Development and Validation of Deep Learning Models for the Multiclassification of Reflux Esophagitis Based on the Los Angeles Classification

This study is to evaluate the feasibility of deep learning (DL) models in the multiclassification of reflux esophagitis (RE) endoscopic images, according to the Los Angeles (LA) classification for the first time. The images were divided into three groups, namely, normal, LA classification A + B, and LA C + D. The images from the HyperKvasir dataset and Suzhou hospital were divided into the training and validation datasets as a ratio of 4 : 1, while the images from Jintan hospital were the independent test set. The CNNs- or Transformer-architectures models (MobileNet, ResNet, Xception, EfficientNet, ViT, and ConvMixer) were transfer learning via Keras. The visualization of the models was proposed using Gradient-weighted Class Activation Mapping (Grad-CAM). Both in the validation set and the test set, the EfficientNet model showed the best performance as follows: accuracy (0.962 and 0.957), recall for LA A + B (0.970 and 0.925) and LA C + D (0.922 and 0.930), Marco-recall (0.946 and 0.928), Matthew's correlation coefficient (0.936 and 0.884), and Cohen's kappa (0.910 and 0.850), which was better than the other models and the endoscopists. According to the EfficientNet model, the Grad-CAM was plotted and highlighted the target lesions on the original images. This study developed a series of DL-based computer vision models with the interpretable Grad-CAM to evaluate the feasibility in the multiclassification of RE endoscopic images. It firstly suggests that DL-based classifiers show promise in the endoscopic diagnosis of esophagitis.


Introduction
Gastroesophageal refux disease (GERD) is a condition in which gastroesophageal refux leads to esophageal mucosal lesions and troublesome symptoms [1,2]. It is classifed into refux esophagitis (RE) with mucosal injuries and nonerosive refux disease (NERD) only with symptoms [3]. Recently, the prevalence of RE has increased in Eastern Asia, due to the westernized lifestyle and diet [4,5]. Te severe complications of RE include ulcer bleeding and strictures. Even though REinduced death is rare, these severe complications are related with signifcant morbidity and mortality rates [6].
Deep learning (DL) is a statistical learning method that empowers computers to extract features of raw data, including structured data, images, text, and audio, without human intervention. Te remarkable progress of DL-based artifcial intelligence (AI) has reshaped various aspects of clinical practice [9]. DL presents a signifcant advantage in the felds of computer vision to analyze medical images and videos containing gigantic quantities of information [10]. In gastroenterology, AI is increasingly being integrated into computer-aided diagnosis (CAD) systems to improve lesions detection and characterization in endoscopy [11]. To our best knowledge, there were no previous reports concerning the application of DL in the endoscopic classifcation of RE.
In this multicentral retrospective study, we aimed to evaluate the feasibility of DL models in the multiclassifcation of RE endoscopic images, according to the LA classifcation.

Datasets.
Subjects who underwent the upper endoscopy were recruited from two hospitals as follows: (1) Suzhou: Te First Afliated Hospital of Soochow University and (2) Jintan: Afliated Hospital of Jiangsu University, between 2015 and 2021. In the two centers, subjects were excluded if they have (1) esophagitis of other etiologies, e.g., pillsinduced esophagitis, eosinophilic, radiation, and infectious esophagitis; (2) esophageal varices; (3) esophageal squamous cell cancer. Tis study was approved by the Ethics Committee of Te First Afliated Hospital of Soochow University and conducted in accordance with the Helsinki Declaration of 1975 as revised in 2000 (the IRB approval number 2022098). All participants signed statements of informed consent before inclusion. Besides, the Z-line endoscopic images were also obtained from an open dataset, Hyper-Kvasir, which now is the largest dataset of the gastrointestinal endoscopy (https://datasets.simula.no/hyper-kvasir/) [12]. Te dataset ofers labeled/unlabeled/segmented image data and annotated video data from Baerum Hospital in Norway. Te characteristic of the datasets was shown in Figure 1. Each endoscopic image of Z-line was determined and labeled as normal, LA classifcation A + B (LA A + B), or LA classifcation C + D (LA C + D) by three rich-experienced endoscopists, based on the LA classifcation. Te endoscopic devices in our hospital include Olympus GIF-Q260, GIF-H290, and Fuji EG-601WR, while in the HyperKvasir dataset, they include Olympus and Pentax at the Department of Gastroenterology, Baerum Hospital.

CNNs-Based
Architectures. Pretrained convolutional neural networks (CNNs) include convolutional layers, average pooling layers, and fully connected layers, with ReLU activation. Besides, two dense layers (ReLU activation) and one dense layer (Softmax activation) were added on the top of the pretrained CNNs layers for feature extraction, as shown in Figure 2(a).

Transformer-Based Architectures.
Transformer is characterized by synchronous input based on the selfattention mechanism. Te Transformer encoder consists of three main components, namely, input embedding, multihead attention, and feed-forward neural networks. Similar as the CNNs, following them, three dense layers (ReLU or Softmax activation) were added on the top of the pretrained Transformer-based architectures.

Training and Validation
2.3.1. Implementation. Te CNNs-or Transformerarchitectures models were transfer learning via Keras (TensorFlow framework as backbone). Te Adam optimizer and the categorical cross-entropy cost function, with a fxed learning rate of 0.0001 and a batch size of 32, were compiled in the training of models. A link to the codes concerning the training procedure could be obtained here on https://osf.io/ 4tdhu/?view_only=b279429b6a284ad885da7cad79126df7.

Target
Training. Endoscopic images of Z-line were saved as JPEG format. All images were rescaled to 331 × 331 pixels and then the pixel values were normalized from 0 to 255 to 0 to 1. Based on the LA classifcation, the images were divided into three groups, namely, normal, LA A + B, and LA C + D. Images from the HyperKvasir dataset and Suzhou hospital were divided into the training and validation datasets as a ratio of 4 : 1. Te fowchart of the study was plotted in Figure 2(b).

External Test.
A total of 600 endoscopic images (as JPEG format) from Jintan hospital were the external test set, including 300 normal, 200 LA A + B, and 100 LA C + D (Figure 2(b)). Te endoscopic devices in Jintan hospital include Olympus GIF-Q260 and GIF-H290.

Comparison with Endoscopists.
To further evaluate the performance of the models, the images from the test dataset were determined by two endoscopists (junior, fve-year endoscopic experience, and senior, more than ten-year experience).

Visualization of the Model.
Te visualization of the models was proposed using Gradient-weighted Class Activation Mapping (Grad-CAM) [13]. Grad-CAM uses the class-specifc gradient information into the fnal convolutional layers of CNNs-based architectures to create projecting maps of the key areas in the images without retraining. Based on the best multiclassifcation model, the Grad-CAM technology was to ofer inferential explanation on the original images.

Performance in the Validation Set.
Te confusion matrix of the six models in the validation set was plotted in Figure 3(a). Te EfcientNet model showed the highest accuracy of 0.962, followed by the ConvMixer model (0.950) and Xception (0.938) (  Journal of Healthcare Engineering while its Marco-recall was 0.946. In term of multiclass metrics, its MCC and Cohen's kappa were also highest (0.936 and 0.910).

Performance in the Test Set.
Te confusion matrix in the test set was plotted in Figure 3(b)). Te EfcientNet model still presented the best performance. Its accuracy was 0.957, followed by ConvMixer (0.943) and Xception (0.936) ( Table 1). Moreover, the recalls for LA A + B and LA C + D of the EfcientNet model were 0.925 and 0.930, while its Marco-recall reached 0.928, better than the other models. In term of multiclass metrics, its MCC and Cohen's kappa were still highest (0.884 and 0.850).

Comparison with the Endoscopists.
In the test set, the junior endoscopist presented an accuracy of 0.916, recalls for LA A + B 0.885 and LA C + D 0.840, Marco-recall 0.863, MCC 0.820, and Cohen's kappa 0.780 (Table 1)

Discussion
Tis study proposed a series of multiclassifcation computer vision models with the interpretable Grad-CAM to evaluate the feasibility of DL in the endoscopic images of RE, according to the LA classifcation. Six CNNs-or Transformer-architectures models were developed and the EfcientNet model showed practicable performance, better than the endoscopists.
In 1999, Lundell et al. [8] developed the LA classifcation to describe the mucosal appearance in endoscopy and to assess its correlation with the clinical changes in patients with RE. It was developed for the purpose of stratifying clinically relevant severity of RE. According to the LA classifcation, type A is defned as one (or more) mucosal break, no longer than 5 mm-long, that does not extend between the tops of two mucosal folds; type B is defned as one (or more) mucosal break, more than 5 mm, that still does not extend between the tops of two folds; type C is defned as one (or more) mucosal break that is continuous between the tops of two or more mucosal folds but which is no longer than 3/4 of the esophageal circumference; type D is defned as one (or more) mucosal break that is more than 3/4 of the circumference [8]. According to the Japan 2021 guideline [3] and the ACG 2021 guideline [7], RE is classifed into mild RE (grade A or B of LA classifcation) and severe RE (grade C or D), in which the latter was defned as the high grade of RE, based on the Lyon Consensus [1]. Te stratifcation is essential to the detailed diagnosis and the decision-making of therapy [3]. Tus, in this study, we labeled the images and trained the multiclassifcation models based on the forementioned guidelines. AI is being widely applied in a variety of clinical settings aiming to improve the management of the gastrointestinal diseases [14]. DL is a subset of machine learning that can automatically extract  Journal of Healthcare Engineering features of input data via artifcial neural networks, organized as CNNs and Transformer [15]. Te past fve years witness a series of studies assessing the performance of DL in the diagnosis of esophageal diseases [16][17][18][19][20][21][22]. Te main application is the computer vision task, consisting of the detection and segmentation lesions in esophageal endoscopic images or video [23,24]. Te CAD system is designed to detect and diferentiate lesions based on the mucosal/ vascular pattern, to stratify the progression of the diseases or to assist the decision-making of therapy [20,25,26]. Te remarkable advantage is reducing the workload of endoscopists and improving diagnostic accuracy [27,28].
Recently, Visaggi et al. [29] performed a meta-analysis concerned machine learning in the diagnosis of esophageal diseases. According to their review, there were a total of 42 studies. Among them, nine were focused on Barrett's esophagus and three were about GERD [30][31][32]. In terms of DL, Ebigbo et al. [33] developed a real-time endoscopic system to classify normal Barrett's esophagus and early esophageal adenocarcinoma, which showed an accuracy of 89.9%. Similarly, a CAD system by de Groof et al. [19] was used to improve the detection of dysplastic Barrett's esophagus. Te ResNet/UNet-based system showed the performance of high accuracy detection and near-perfect segmentation, better than general endoscopists. One month ago, Tang et al. [17] trained a multitask DL model to diagnose esophageal lesions (normal vs. cancer vs. esophagitis). According to their report, the model achieved a high accuracy (93.43%) in complex classifcation, as well as a satisfed coefcient (77.84%) in semantic segmentation. Guimaraes et al. [18] proposed a CNNs-based multiclassifcation model (normal vs. eosinophilic esophagitis vs. candidiasis). In the test set, the model presented a fne global accuracy (0.915), higher than endoscopists.
In this multicentral study, six CNNs-or Transformerarchitectures computer vision models were transfer  Journal of Healthcare Engineering learning to the multiclassifcation of RE endoscopic images, according to the LA classifcation. Te EfcientNet model displayed the highest accuracy and Marco-recall. EfcientNet is a CNNs architecture and scaling method that uniformly scales all dimensions with a set of fxed scaling coefcients [34]. Tere are various models designed to improve training efciency, e.g., Transformer blocks in Transformers-architectures models. But expensive overhead depending on parameter size comes as an issue. EfcientNetV2 is the successor of EfcientNet, which is a family of models optimized for foating point operations and parameter efciency. In 2021, Google used a combination of training-aware neural architecture search, scaling to further optimize the training speed and parameter efciency to develop this new family [35]. EfcientNetV2 overcomes some of the training bottlenecks and outperforms the V1 models. Moreover, compared with Transformer-architectures models, Ef-cientNet shows advantage in this small dataset with limited computing power. In the comparison with the endoscopists, the EfcientNet model also showed advantages both in accuracy and recall. Interpretability for a DL model has been one of the essential respects.
Computer scientists and medical practitioners are showing more concerns about the inference of AI during the development of models, especially in the feld of computer vision. Terefore, lastly, we proposed the Grad-CAM technology to visualize the inferential explanation on the original images. Our study has some limitations. To begin with, we only focus on esophagitis caused by refex, rather than various etiologies, e.g., radiation, eosinophilic, and pill-induced esophagitis. Further studies, based on medical history and biopsy, are required to develop more complex classifers for esophagitis. Besides, the images dataset was limited, while video fles were not involved in the analyzation. Tis study still required more data for validation. Lastly, we did not deploy the models in endoscopic devices. We believe that this study may contribute to the future deployment in the actual practice.

Conclusions
In this study, we developed a series of DL-based computer vision models with the interpretable Grad-CAM to evaluate the feasibility of AI in the multiclassifcation of RE endoscopic images for the frst time. It suggests that DL-based classifers show promise in the endoscopic diagnosis of esophagitis. In the future, it is necessary to investigate the multimodal fusion in the classifcation of RE, integrating endoscopic images, clinical symptoms, esophageal pH monitoring, etc.

Data Availability
Te dataset used to support the fndings of the study is from HyperKvasir, which is now the largest open dataset of the gastrointestinal endoscopy (https://datasets.simula.no/ hyper-kvasir/).

Ethical Approval
Tis study was approved by the Ethics Committee of Te First Afliated Hospital of Soochow University (the IRB approval number 2022098). All procedures performed in studies involving human participants were in accordance with the Helsinki Declaration of 1975 as revised in 2000. All subjects gave written informed consent before participation.

Conflicts of Interest
Te authors declare that they have no conficts of interest.